We will use the following libraries through this analysis:
import pandas as pd
import json
import numpy as np
import plotly.express as px
from IPython.display import display, Markdown
import scipy.stats as sp
In this Data Science paper I analyze the results of the presidential elections in Poland in 2020. The results are taken from the website of the National Electoral Commission (Polish: Państwowa Komisja Wyborcza) – link.
In this election the voters were eligible to vote for the following candidates:
candidates_df = pd.read_csv('data/candidates/candidates.csv', sep=';')
candidates_df.set_index('Pozycja na karcie', inplace=True)
candidates_df
We generate a list of these candidates to filer their results later on.
candidates_first_name = candidates_df['Imiona'].tolist()
candidates_last_name = candidates_df['Nazwisko'].tolist()
candidates = np.array(
list(zip(candidates_first_name, candidates_last_name))
)
candidates = [candidate[0] + ' ' + candidate[1] for candidate in candidates]
candidates
Since this full name format is not very good to display in charts, we create a function to display only the last name in Title Case format.
def get_last_name(full_name):
"""Get the last name from full name."""
return full_name.split(' ')[-1].title()
get_last_name('Władysław Marcin KOSINIAK-KAMYSZ')
To differentiate maps of the candidates, we will use colors associated with their campaign.
candidates_colors = {
'Robert BIEDROŃ' : px.colors.sequential.Reds,
'Krzysztof BOSAK' : px.colors.sequential.Greys,
'Andrzej Sebastian DUDA' : px.colors.sequential.Blues,
'Szymon Franciszek HOŁOWNIA' : px.colors.sequential.YlOrBr,
'Marek JAKUBIAK' : px.colors.sequential.Purples,
'Władysław Marcin KOSINIAK-KAMYSZ' : px.colors.sequential.Greens,
'Mirosław Mariusz PIOTROWSKI' : px.colors.sequential.matter,
'Paweł Jan TANAJNO' : px.colors.sequential.PuBu,
'Rafał Kazimierz TRZASKOWSKI' : px.colors.sequential.Oranges,
'Waldemar Włodzimierz WITKOWSKI' : px.colors.sequential.YlOrRd,
'Stanisław Józef ŻÓŁTEK' : px.colors.sequential.tempo
}
Firstly, we find how many voters have voted for each candidate.
results_counties_df = pd.read_csv('data/results/results_by_county.csv', sep=';')
total_results_df = (
results_counties_df[candidates].sum().to_frame('Result')
.sort_values('Result', ascending=False)
.reset_index()
)
total_results_df.columns = ['Candidate', 'Result']
total_results_df['Candidate'] = total_results_df['Candidate'].apply(get_last_name)
total_results_df
We then display this results as a bar chart.
total_results_fig = px.bar(total_results_df, x='Candidate', y='Result', title='Total number of votes', text='Result')
total_results_fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
total_results_fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
total_results_fig.show()
But we also want to see the results in percents. We find the total number of votes to get that information.
total_num_of_votes = total_results_df['Result'].sum()
total_num_of_votes
def find_percent(n):
"""Find the percent of n with regards to the total number of votes."""
return round(n / total_num_of_votes * 100, 2)
total_results_percent_df = pd.concat([total_results_df['Candidate'], total_results_df['Result'].map(find_percent)], axis=1)
total_results_percent_df
total_results_percent_fig = px.bar(
total_results_percent_df, x='Candidate', y='Result', title='Total number of votes (in %)', text='Result'
)
total_results_percent_fig.update_traces(texttemplate='%{text}%', textposition='outside')
total_results_percent_fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
total_results_percent_fig.show()
Now, we compare results of each candidate (given in %) with regard to county.
results_counties_percent_df = pd.read_csv('data/results/results_by_county_percent.csv', sep=';')
results_counties_percent_df = results_counties_percent_df[['Kod TERYT', 'Powiat'] + candidates]
for candidate in candidates:
results_counties_percent_df[candidate] = results_counties_percent_df[candidate].apply(lambda x: float(x.replace(',', '.')))
results_counties_percent_df.head()
Simultaneusly, we import the geographical data about borders of each county from the official data of the Head Office of Geodesy and Cartography. The webiste of GIS Support PL let us solely download the package with counties. To create maps I will use GeoJSON format. The data from the websites mentioned before has the .shp extension, so I have formatted it to GeoJSON using MapShaper.
with open('data/geojson/counties.json', encoding='utf-8') as response:
counties = json.load(response)
counties['features'][0]['properties']
The TERYT code is a unique code of each administrative unit. In the elections result the code has two extra 00. Additionally, it doesn't have a leading zero when its voivodeship number is only on digit. We are going to fix this issues to connect these two datasets.
def fix_teryt(teryt):
"""Fix TERYT code to integrate the two datasets."""
teryt = str(teryt)
if len(teryt) == 5:
teryt = '0' + teryt
return teryt[:-2]
results_counties_percent_df['Kod TERYT'] = results_counties_percent_df['Kod TERYT'].astype(str).map(fix_teryt)
results_counties_percent_df.head()
This is the location of the key that will join our data sets in counties JSON:
counties['features'][0]['properties']['JPT_KOD_JE']
def get_figure_results_by_county(candidate):
"""Get figure showing a map of results of the given cadidate by county."""
candidate_df = results_counties_percent_df[['Kod TERYT', 'Powiat', candidate]]
# We remove the results from ships and abroad because they will not be shown on the map
candidate_df = candidate_df[candidate_df.Powiat != 'statki']
candidate_df = candidate_df[candidate_df.Powiat != 'zagranica']
fig = px.choropleth_mapbox(
candidate_df, geojson=counties, color=candidate,
locations='Kod TERYT', featureidkey="properties.JPT_KOD_JE",
center={"lat": 52, "lon": 19.1451},
opacity=0.8, color_continuous_scale=candidates_colors[candidate],
hover_data={'Powiat': True, 'Kod TERYT': False},
mapbox_style="carto-positron", zoom=5.2
)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
return fig
for candidate in candidates:
display(Markdown(f'### Results of {candidate} by county'))
get_figure_results_by_county(candidate).show()
Analyzing these maps, one can see that for some candidates the electorate is spread similarly around the whole country. Meanwhile, the others have much greater support in some regions. Who is the candidate with the most equally spread electorate?
coefficient_of_variation_df = pd.DataFrame(
results_counties_percent_df[candidates].apply(sp.variation)
).sort_values(by=0).transpose()
coefficient_of_variation_df
As we see, Krzysztof Bosak is the most equally supported candidate in Poland. He is follwed by Stanisław Żółek and Andrzej Duda. Rafał Trzaskowski is the 8th in this comparison. Marek Jakubiak is at the end of the list.
The crucial challange Andrzej Duda and Rafał Trzaskowski will need to face in the second round is to convince the voters who did not vote them in the first round. What counties have the most voters to convince? In other words, what counties should the two candidates focus on the most in the campaign?
We first find the number of voters of the other candidates in each county.
candidates_2nd_round = ['Andrzej Sebastian DUDA', 'Rafał Kazimierz TRZASKOWSKI']
candidates_no_2nd_round = [
candidate
for candidate in candidates
if candidate not in candidates_2nd_round
]
candidates_no_2nd_round_df = pd.DataFrame(results_counties_df[candidates_no_2nd_round].sum(axis=1))
candidates_no_2nd_round_df.columns = ['Other electorate']
results_potential_2nd_round_df = pd.concat(
[results_counties_df[['Powiat', 'Kod TERYT']], candidates_no_2nd_round_df], axis=1
)
results_potential_2nd_round_df['Kod TERYT'] = results_potential_2nd_round_df['Kod TERYT'].astype(str).map(fix_teryt)
results_potential_2nd_round_df.head()
We plot it.
# We remove the results from ships and abroad because they will not be shown on the map
results_potential_2nd_round_df = results_potential_2nd_round_df[results_potential_2nd_round_df.Powiat != 'statki']
results_potential_2nd_round_df = results_potential_2nd_round_df[results_potential_2nd_round_df.Powiat != 'zagranica']
results_potential_2nd_round_fig = px.choropleth_mapbox(
results_potential_2nd_round_df, geojson=counties, color='Other electorate',
locations='Kod TERYT', featureidkey="properties.JPT_KOD_JE",
center={"lat": 52, "lon": 19.1451},
opacity=0.8, color_continuous_scale=px.colors.sequential.Reds,
hover_data={'Powiat': True, 'Kod TERYT': False},
mapbox_style="carto-positron", zoom=5.2
)
results_potential_2nd_round_fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
results_potential_2nd_round_fig.show()
They are mainly in big cities. It might be better to see how this looks in relative electorate.
candidates_no_2nd_round_percent_df = pd.DataFrame(results_counties_percent_df[candidates_no_2nd_round].sum(axis=1))
candidates_no_2nd_round_percent_df.columns = ['Other electorate [%]']
results_potential_2nd_round_percent_df = pd.concat(
[results_counties_df[['Powiat', 'Kod TERYT']], candidates_no_2nd_round_percent_df], axis=1
)
results_potential_2nd_round_percent_df['Kod TERYT'] = \
results_potential_2nd_round_percent_df['Kod TERYT'].astype(str).map(fix_teryt)
# We remove the results from ships and abroad because they will not be shown on the map
results_potential_2nd_round_percent_df = \
results_potential_2nd_round_percent_df[results_potential_2nd_round_percent_df.Powiat != 'statki']
results_potential_2nd_round_percent_df = \
results_potential_2nd_round_percent_df[results_potential_2nd_round_percent_df.Powiat != 'zagranica']
results_potential_2nd_round_percent_fig = px.choropleth_mapbox(
results_potential_2nd_round_percent_df, geojson=counties, color='Other electorate [%]',
locations='Kod TERYT', featureidkey="properties.JPT_KOD_JE",
center={"lat": 52, "lon": 19.1451},
opacity=0.8, color_continuous_scale=px.colors.sequential.Reds,
hover_data={'Powiat': True, 'Kod TERYT': False},
mapbox_style="carto-positron", zoom=5.2
)
results_potential_2nd_round_percent_fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
results_potential_2nd_round_percent_fig.show()
The received map is somewhat similar to the map of people who voted for Rafał Trzaskowski. It is thus more likely that he will get more new voters in the second round.
TODO tommorow, PKW has removed the data and voivodeship will not be finished until they upload the data once again.
It may be also helpful to see who has won in a larger administrative units - voivodeships. Voivodeships in Poland are based on historical regions, so it is quite common that people from the same voivodeship share similar values.
results_voivodeships_percent_df = pd.read_csv('data/results/results_by_voivodeship.csv', sep=';')
results_voivodeships_percent_df